Introduction and Background

Wildfires are growing in both frequency and intensity. This worrying trend is largely linked to global warming, human activities that impact the environment, and natural ecological shifts. The significant damage caused by these events underscores the urgent need for reliable tools that can predict wildfire risk, allowing for better preparation and response.

This project investigates the application of predictive algorithms to forecast critical wildfire characteristics in California, specifically focusing on the duration of active burning (persistence) and the potential for fire expansion (spread risk). Utilizing a comprehensive dataset of historical wildfire incidents and relevant environmental variables, we developed and evaluated predictive models for these key wildfire attributes. The results of this project demonstrate the potential of offering valuable insights for proactive mitigation and resource allocation in this fire-prone region.

library(dplyr)
library(tidyr)
library(glmnet)
library(BART)
library(randomForest)
library(ggplot2)
library(readxl)
library(plotly)
library(ggcorrplot)
library(patchwork)
data <- read.csv("../Data/WildFire_DataSet.csv")
data |> head(5)
glimpse(data)
Rows: 1,132
Columns: 47
$ incident_id                      <chr> "2ca11d45-8139-4c16-8af0-880d99b21e82", "8f61f461-552d-4538-b186-35a…
$ incident_url                     <chr> "https://www.fire.ca.gov/incidents/2017/10/31/bridge-fire/", "https:…
$ incident_type                    <chr> "", "Wildfire", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ incident_name                    <chr> "bridge", "pala", "river", "gold", "panther", "silverado", "yellow",…
$ incident_date_created            <chr> "2017-10-31 11:22:00+00:00", "2009-05-24 14:56:00+00:00", "2013-02-2…
$ incident_date_extinguished       <chr> "2018-01-09 13:46:00+00:00", "2009-05-25 00:00:00+00:00", "2013-02-2…
$ incident_date_last_update        <chr> "2018-01-09T13:46:00Z", "2020-09-16T14:07:35Z", "2022-10-24T11:39:23…
$ incident_dateonly_extinguished   <chr> "1/9/18", "5/25/09", "2/28/13", "5/1/13", "5/9/13", "5/1/13", "5/3/1…
$ incident_dateonly_created        <chr> "10/31/17", "5/24/09", "2/24/13", "4/30/13", "5/1/13", "4/30/13", "5…
$ incident_is_final                <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
$ is_active                        <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"…
$ incident_administrative_unit     <chr> " Shasta-Trinity National Forest ", "CAL FIRE San Diego Unit", "CAL …
$ incident_administrative_unit_url <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ incident_county                  <chr> "Shasta", "San Diego", "Inyo", "Madera", "Tehama", "Napa", "Sonoma",…
$ incident_location                <chr> "I-5 and Turntable Bay, 7 miles NE of Shasta Lake ", "Hwy 76 and Pal…
$ incident_longitude               <dbl> -122.3090, -117.2036, -118.0165, -119.6350, -121.5956, -122.3508, -1…
$ incident_latitude                <dbl> 40.77400, 33.68240, 36.60258, 37.11630, 40.19006, 38.44179, 38.63883…
$ incident_acres_burned            <int> 37, 122, 407, 274, 6896, 75, 125, 2956, 354, 217, 75, 650, 712, 35, …
$ incident_containment             <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ incident_control                 <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ incident_cooperating_agencies    <chr> " Shasta-Trinity National Forest ", "CAL FIRE San Diego Unit", "CAL …
$ calfire_incident                 <chr> "False", "True", "True", "True", "True", "True", "True", "True", "Tr…
$ notification_desired             <chr> "False", "False", "False", "False", "False", "False", "False", "Fals…
$ Fire_Duration                    <int> 70, 0, 4, 0, 7, 0, 2, 3, 0, 3, 1, 1, 1, 0, 0, 0, 0, 0, 3, 0, 4, 3, 1…
$ log_acres_burned                 <dbl> 3.637586, 4.812184, 6.011267, 5.616771, 8.838842, 4.330733, 4.836282…
$ wind_temp                        <dbl> 95.49381, 351.41634, 22.53214, 140.00773, 114.24478, 345.69642, 194.…
$ elevation                        <dbl> 355, 441, 1116, 490, 1379, 63, 248, 2412, 27, 534, 620, 926, 374, 17…
$ landcover                        <dbl> 42, 22, 52, 71, 42, 22, 43, 42, 82, 52, 21, 71, 52, 71, 52, 71, 52, …
$ aspect                           <dbl> 74, 67, 23, 180, 186, 270, 306, 169, 121, 71, 141, 41, 268, 71, 25, …
$ slope                            <dbl> 39, 2, 3, 11, 11, 5, 18, 17, 5, 22, 4, 9, 29, 16, 18, 10, 8, 8, 7, 9…
$ uid                              <chr> "2ca11d45-8139-4c16-8af0-880d99b21e82bridge", "8f61f461-552d-4538-b1…
$ min_temp                         <dbl> -0.3, 26.0, -0.2, 12.3, 4.3, 14.3, 11.9, 3.8, 14.8, 11.0, 13.2, 8.6,…
$ max_temp                         <dbl> 25.2, 27.5, 15.2, 30.4, 26.5, 30.9, 32.8, 21.8, 35.3, 32.3, 36.0, 30…
$ avg_temp                         <dbl> 9.692782, 26.668750, 6.235833, 21.470833, 12.979630, 23.227083, 22.6…
$ avg_windspeed                    <dbl> 9.852054, 13.177083, 3.613333, 6.520833, 8.801852, 14.883333, 8.5833…
$ avg_precipitation                <dbl> 0.250058685, 0.220833333, 0.000000000, 0.000000000, 0.127777778, 0.0…
$ FIRE_NAME                        <chr> "bridge", "pala 4", "river", "gold", "panther", "silverado", "yellow…
$ ALARM_DATE                       <chr> "10/31/17", "5/24/09", "2/24/13", "4/30/13", "5/1/13", "4/30/13", "5…
$ GIS_ACRES                        <dbl> 35.51982, 106.11540, 406.84150, 183.64240, 6896.19800, 60.21228, 83.…
$ CAUSE                            <int> 2, 14, 14, 14, 9, 5, 11, 14, 2, 5, 2, 14, 3, 14, 9, 2, 7, 14, 18, 9,…
$ Year_Started                     <int> 2017, 2009, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…
$ ndvi                             <dbl> 0.871471471, NA, 0.088693297, 0.021794222, 0.136799303, 0.283749363,…
$ avg_pdsi                         <dbl> 2.7200000, NA, -2.0800000, -0.8916667, 1.0033333, 1.8200000, 1.73000…
$ avg_spi30d                       <dbl> 0.951666658, NA, -0.218333333, -0.224999999, -0.231666664, -0.249999…
$ fuel_type                        <chr> "Forest", "", "Shrubland", "Grassland", "Forest", "Urban", "Forest",…
$ cause_description                <chr> "Equipment/vehicles", "Unknown (Human)", "Unknown (Human)", "Unknown…
$ ndmi                             <dbl> 0.2141261995, 0.0059500197, -0.0380486585, 0.0489031225, 0.109610758…

Data Cleaning

library(naniar)
gg_miss_var(data) + labs(title = "Missing Data by Variable", y="Missing Counts")

head(data)
#Extracting Total Fire Duration 
year_fire <- data %>%
  group_by(Year_Started) %>%
  summarise(Total_Fire_Duration = sum(Fire_Duration, na.rm = TRUE), num_fire = n()) %>%
  mutate(avg_fire_dur = Total_Fire_Duration/num_fire) |> 
  dplyr::ungroup()

year_fire |> head(15)

Exploratory Data Analysis

# Create each plot
p1 <- ggplot(data, aes(x = Year_Started)) +
  geom_bar(fill = "firebrick") +
  theme_linedraw() +
  labs(title = "Number of Fires per Year", x = "Year", y = "Count")
  

p2 <- ggplot(year_fire, aes(x = as.factor(Year_Started), y = avg_fire_dur)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(
    title = "Fire Duration According to the Year",
    x = "Year",
    y = "Fire Duration in Days"
  ) +
  scale_x_discrete(limits = as.character(2013:2023)) +
  theme_linedraw() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine them side by side
p1 + p2

The bar charts displays the frequency and the total fire duration (in days) per year from 2013 to 2022. It highlights a dramatic spike in fire duration during 2017 and 2018, indicating those years experienced significantly longer wildfire events compared to other years in the dataset.

# Compute average acres burned per fuel type
fuel_stats <- data %>%
  group_by(fuel_type) %>%
  summarise(avg_acres_burned = mean(incident_acres_burned, na.rm = TRUE)) %>%
  arrange(desc(avg_acres_burned))

# Plot
ggplot(fuel_stats, aes(x = reorder(fuel_type, -avg_acres_burned), y = avg_acres_burned)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_linedraw() +
  labs(title = "Average Acres Burned by Fuel Type",
       x = "Fuel Type", y = "Average Acres Burned") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This bar chart displays the average acres burned by different fuel types. Forests have the highest average area burned, followed by shrubland and grassland. Non-burnable and barren areas show minimal fire impact.

data_new = data
data_new %>%
  count(fuel_type, sort = TRUE) %>%
  ggplot(aes(x = reorder(fuel_type, -n), y = n)) +
  geom_bar(stat = "identity", fill = "forestgreen") +
  labs(title = "Number of Incidents by Fuel Type", x = "Fuel Type", y = "Count") +
  theme_linedraw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This bar chart illustrates the number of fire incidents by fuel type. Urban, shrubland, and grassland areas report the highest incident counts. Despite forests having fewer incidents, they tend to result in larger burned areas compared to other types.

Geographical Scatter Plot

The map below displays the geographic distribution of wildfire ignition points across California, marked by red dots. It shows a high concentration of wildfire starts, highlighting areas most prone to fire activity.

df_train <- data %>%
  mutate(scaled_size = pmin(sqrt(incident_acres_burned) * 0.2))  # Cap at 15

# Create scatter map with better visibility
fig_location <- plot_ly(
  data = data,
  type = "scattermapbox",
  lon = ~incident_longitude,
  lat = ~incident_latitude,
  text = ~paste("Acres Burned:", incident_acres_burned),
  marker = list(color = "red"),
  mode = "markers"
) %>%
  layout(
    title = "Fire Locations",
    mapbox = list(
      style = "carto-positron",
      zoom = 5,  # Adjust zoom level
      center = list(
        lon = mean(data$incident_longitude, na.rm = TRUE),
        lat = mean(data$incident_latitude, na.rm = TRUE)
      )
    )
  )

# Show the plot
fig_location

Wildfire Analysis by Weather Factors

# Create individual scatter plots
p1 <- ggplot(data, aes(x = avg_windspeed, y = incident_acres_burned)) +geom_point(color = "blue") +theme_grey()

p2 <- ggplot(data, aes(x = avg_temp, y = incident_acres_burned)) +geom_point(color = "red") +theme_grey()

p3 <- ggplot(data, aes(x = avg_precipitation, y = incident_acres_burned)) +geom_point(color = "green4") +theme_grey()
# Arrange plots in a row with consistent sizing and shared Y axis
fig <- subplot(p1, p2, p3, nrows = 1, shareY = TRUE, titleX = TRUE, titleY = TRUE, widths = c(0.33, 0.33, 0.33)) %>%
  layout(
    title = list(
      text = "Wildfire Analysis by Weather Factors",  
      x = 0.5,    
      xanchor = "center",
      font = list(size = 20)  
    ),
    showlegend = TRUE,
    margin = list(l = 50, r = 50, b = 100, t = 100),
    xaxis = list(title = "Avg Wind Speed", zeroline = FALSE),
    xaxis2 = list(title = "Avg Temperature", zeroline = FALSE),
    xaxis3 = list(title = "Avg Precipitation", zeroline = FALSE),
    yaxis = list(title = "Acres Burned")
  )

fig

Correlation Heat Map

The heatmap illustrates the correlation between various features related to wildfire behavior and environmental conditions. Notably, minimum temperature shows a strong positive correlation with maximum temperature (0.84) and a strong negative correlation with fire duration (-0.78).

# Select specific columns from df2
df_cor <- data %>%
  select(Fire_Duration, incident_acres_burned, min_temp, max_temp, avg_temp, avg_windspeed,
         avg_precipitation, elevation,aspect, slope, ndvi, avg_pdsi, avg_spi30d, ndmi)

# Compute correlation matrix
cor_matrix <- cor(df_cor, use = "complete.obs")

# Plot correlation heatmap
ggcorrplot(cor_matrix, 
           method = "square", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 5, 
           colors = c("blue", "white", "red"),
           title = "Feature Correlation Heatmap") +
  theme_linedraw()+
  theme(
    axis.text.x = element_text(face = "bold", size = 12, angle = 90),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold", size = 20, hjust = 0.5)
  )

---
title: "Data Mining Project"
subtitle: "California Wildfire Prediction"
author: "Shruti Elangovan, Anurag Mallik, Diksha Phuloria"
date: "04/07/2025"
output: 
  html_notebook:
    toc: False
---

# Introduction and Background
> Wildfires are growing in both frequency and intensity. This worrying trend is largely linked to global warming, human activities that impact the environment, and natural ecological shifts. The significant damage caused by these events underscores the urgent need for reliable tools that can predict wildfire risk, allowing for better preparation and response.
>
> This project investigates the application of predictive algorithms to forecast critical wildfire characteristics in California, specifically focusing on the duration of active burning (persistence) and the potential for fire expansion (spread risk). Utilizing a comprehensive dataset of historical wildfire incidents and relevant environmental variables, we developed and evaluated predictive models for these key wildfire attributes. The results of this project demonstrate the potential of offering valuable insights for proactive mitigation and resource allocation in this fire-prone region.

```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(glmnet)
library(BART)
library(randomForest)
library(ggplot2)
library(readxl)
library(plotly)
library(ggcorrplot)
library(patchwork)
```

```{r}
data <- read.csv("../Data/WildFire_DataSet.csv")
```

```{r}
data |> head(5)
```

```{r}
glimpse(data)
```

## Data Cleaning 

```{r, fig.height=12, fig.width=8}
library(naniar)
gg_miss_var(data) + labs(title = "Missing Data by Variable", y="Missing Counts")
```

```{r}
head(data)
```

```{r}
#Extracting Total Fire Duration 
year_fire <- data %>%
  group_by(Year_Started) %>%
  summarise(Total_Fire_Duration = sum(Fire_Duration, na.rm = TRUE), num_fire = n()) %>%
  mutate(avg_fire_dur = Total_Fire_Duration/num_fire) |> 
  dplyr::ungroup()

year_fire |> head(15)
```

## Exploratory Data Analysis

```{r, fig.width=10, warning=FALSE}
# Create each plot
p1 <- ggplot(data, aes(x = Year_Started)) +
  geom_bar(fill = "firebrick") +
  theme_linedraw() +
  labs(title = "Number of Fires per Year", x = "Year", y = "Count")
  

p2 <- ggplot(year_fire, aes(x = as.factor(Year_Started), y = avg_fire_dur)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(
    title = "Fire Duration According to the Year",
    x = "Year",
    y = "Fire Duration in Days"
  ) +
  scale_x_discrete(limits = as.character(2013:2023)) +
  theme_linedraw() +
  theme(plot.title = element_text(hjust = 0.5))

# Combine them side by side
p1 + p2
```

> The bar charts displays the frequency and the total fire duration (in days) per year from 2013 to 2022. It highlights a dramatic spike in fire duration during 2017 and 2018, indicating those years experienced significantly longer wildfire events compared to other years in the dataset.

```{r}
# Compute average acres burned per fuel type
fuel_stats <- data %>%
  group_by(fuel_type) %>%
  summarise(avg_acres_burned = mean(incident_acres_burned, na.rm = TRUE)) %>%
  arrange(desc(avg_acres_burned))

# Plot
ggplot(fuel_stats, aes(x = reorder(fuel_type, -avg_acres_burned), y = avg_acres_burned)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_linedraw() +
  labs(title = "Average Acres Burned by Fuel Type",
       x = "Fuel Type", y = "Average Acres Burned") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```

>This bar chart displays the average acres burned by different fuel types. Forests have the highest average area burned, followed by shrubland and grassland. Non-burnable and barren areas show minimal fire impact.


```{r}
data_new = data
data_new %>%
  count(fuel_type, sort = TRUE) %>%
  ggplot(aes(x = reorder(fuel_type, -n), y = n)) +
  geom_bar(stat = "identity", fill = "forestgreen") +
  labs(title = "Number of Incidents by Fuel Type", x = "Fuel Type", y = "Count") +
  theme_linedraw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
> This bar chart illustrates the number of fire incidents by fuel type. Urban, shrubland, and grassland areas report the highest incident counts. Despite forests having fewer incidents, they tend to result in larger burned areas compared to other types.

## Geographical Scatter Plot

> The map below displays the geographic distribution of wildfire ignition points across California, marked by red dots. It shows a high concentration of wildfire starts, highlighting areas most prone to fire activity.

```{r,fig.width=10, warning=FALSE}
df_train <- data %>%
  mutate(scaled_size = pmin(sqrt(incident_acres_burned) * 0.2))  # Cap at 15

# Create scatter map with better visibility
fig_location <- plot_ly(
  data = data,
  type = "scattermapbox",
  lon = ~incident_longitude,
  lat = ~incident_latitude,
  text = ~paste("Acres Burned:", incident_acres_burned),
  marker = list(color = "red"),
  mode = "markers"
) %>%
  layout(
    title = "Fire Locations",
    mapbox = list(
      style = "carto-positron",
      zoom = 5,  # Adjust zoom level
      center = list(
        lon = mean(data$incident_longitude, na.rm = TRUE),
        lat = mean(data$incident_latitude, na.rm = TRUE)
      )
    )
  )

# Show the plot
fig_location
```

## Wildfire Analysis by Weather Factors

```{r, fig.width=12}
# Create individual scatter plots
p1 <- ggplot(data, aes(x = avg_windspeed, y = incident_acres_burned)) +geom_point(color = "blue") +theme_grey()

p2 <- ggplot(data, aes(x = avg_temp, y = incident_acres_burned)) +geom_point(color = "red") +theme_grey()

p3 <- ggplot(data, aes(x = avg_precipitation, y = incident_acres_burned)) +geom_point(color = "green4") +theme_grey()
# Arrange plots in a row with consistent sizing and shared Y axis
fig <- subplot(p1, p2, p3, nrows = 1, shareY = TRUE, titleX = TRUE, titleY = TRUE, widths = c(0.33, 0.33, 0.33)) %>%
  layout(
    title = list(
      text = "Wildfire Analysis by Weather Factors",  
      x = 0.5,    
      xanchor = "center",
      font = list(size = 20)  
    ),
    showlegend = TRUE,
    margin = list(l = 50, r = 50, b = 100, t = 100),
    xaxis = list(title = "Avg Wind Speed", zeroline = FALSE),
    xaxis2 = list(title = "Avg Temperature", zeroline = FALSE),
    xaxis3 = list(title = "Avg Precipitation", zeroline = FALSE),
    yaxis = list(title = "Acres Burned")
  )

fig
```

## Correlation Heat Map

> The heatmap illustrates the correlation between various features related to wildfire behavior and environmental conditions. Notably, minimum temperature shows a strong positive correlation with maximum temperature (0.84) and a strong negative correlation with fire duration (-0.78).

```{r, fig.height=10, fig.width=10}
# Select specific columns from df2
df_cor <- data %>%
  select(Fire_Duration, incident_acres_burned, min_temp, max_temp, avg_temp, avg_windspeed,
         avg_precipitation, elevation,aspect, slope, ndvi, avg_pdsi, avg_spi30d, ndmi)

# Compute correlation matrix
cor_matrix <- cor(df_cor, use = "complete.obs")

# Plot correlation heatmap
ggcorrplot(cor_matrix, 
           method = "square", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 5, 
           colors = c("blue", "white", "red"),
           title = "Feature Correlation Heatmap") +
  theme_linedraw()+
  theme(
    axis.text.x = element_text(face = "bold", size = 12, angle = 90),
    axis.text.y = element_text(face = "bold", size = 12),
    plot.title = element_text(face = "bold", size = 20, hjust = 0.5)
  )
```
